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Translation of results from genetic findings to inform medical practice is a highly 
anticipated goal of human genetics. The aim of this paper is to review and discuss the role 
of genetics in medically-relevant prediction. Germline genetics presages disease onset 
and therefore can contribute prognostic signals that augment laboratory tests and clinical 
features. As such, the impact of genetic-based predictive models on clinical decisions 
and therapy choice could be profound. However, given that (i) medical traits result from a 
complex interplay between genetic and environmental factors, (ii) the underlying genetic 
architectures for susceptibility to common diseases are not well-understood, and (ill) 
replicable susceptibility alleles, in combination, account for only a moderate amount of 
disease heritability, there are substantial challenges to constructing and implementing 
genetic risk prediction models with high utility. In spite of these challenges, concerted 
progress has continued in this area with an ongoing accumulation of studies that 
identify disease predisposing genotypes. Several statistical approaches with the aim of 
predicting disease have been published. Here we summarize the current state of disease 
susceptibility mapping and pharmacogenetics efforts for risk prediction, describe methods 
used to construct and evaluate genetic-based predictive models, and discuss applications. 
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INTRODUCTION AND BACKGROUND 

Multiple lines of evidence strongly support the notion that the 
large majority of common, chronic diseases have complex causes. 
Environmental components such as infection, caloric flux, and 
chemical exposure, along with heritable elements such as DNA 
variants, methylation patterns, and epigenetic RNA effects, are 
interacting co-conspirators resulting in common diseases. In 
this background of convoluted and entangled etiology, discov- 
ery and use of disease predisposing alleles present a consid- 
erable challenge to the human genetics community (Clerget- 
Darpoux and Elston, 2013). Recent technological advances in 
high-throughput genotyping, RNA expression, and massively 
parallel sequencing have accelerated interrogation of genetic vari- 
ation for the purpose of understanding human disease and drug 
response. Among the more important uses of these discover- 
ies is providing detailed, mechanistic insight into the molec- 
ular pathogenesis of disease states. The two primary avenues 
of utilizing this explosion in genetic information for the pur- 
pose of improving clinical practice are in (1) drug development 



^This is inspired by a humorous quote that is variably attributed to Mark 
Twain, Niels Bohr, the Danish Parhament, Samuel Goldwyn, and Yogi Berra. 



stemming from the identification of molecular targets and (2) the 
prediction of disease susceptibility, pharmacogenetic response, 
and disease severity/trajectory (Khoury et al., 1985; Holtzman 
and Marteau, 2000; Evans and Relling, 2004). Although only 
a small minority of current pharmaceuticals originated directly 
from genetic findings serving as drug targets, the list is expanding 
and includes inflammatory cytokine-based monoclonal antibod- 
ies and targeted cancer therapeutics, among others. These ther- 
apeutics often target specific biochemical pathways to improve 
clinical treatment, often with a reduction in adverse reactions. 
Disease prediction and diagnosis with genetic testing is a broad 
field with diverse applications, ranging from karyotyping for 
chromosomal abnormalities to enhancement of disease risk pro- 
files using single nucleotide polymorphisms (SNPs) previously 
found to be disease-susceptibility markers, such as HFE missense 
polymorphisms which can lead to hemochromatosis, or the vari- 
ants in the tumor suppressors BRCAl and BRCA2 that increase 
risk to breast and ovarian cancers. Clinical genetics testing can 
provide physicians with an additional tool for better diagnosis 
and improved medical care. 

Much of the variation in disease course, severity, and response 
to medication is reflective of the underlying allelic repertoire 
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existing in each individual, offering the opportunity for genet- 
ics to facilitate early treatment, preventative medicine, preemptive 
selection of efficacious drugs, and more accurate estimation of 
risk for those thought to be at intermediate risk using traditional 
factors. As the cost and complexity of medical care escalates, 
the promise of human genetics to provide directly actionable, 
individualized information to address impediments to optimal 
and cost-effective medical practice carries increasing weight and 
urgency (Chen and Snyder, 2013). This review has multiple aims: 
(1) provide a brief overview of the current state of human dis- 
ease mapping as this provides the foundational knowledge for 
genetic-based disease prediction, (2) describe the process of dis- 
ease prediction in a simple probabilistic framework detailing the 
general qualities of clinically useful predictive models and also 
detailed examples, (3) provide an overview of the basic classes of 
genetic-based prediction models and measures of prognostic util- 
ity, and (4) illustrate the application of genetic-based predictive 
models to data from biobanks and prospective cohorts. 

Identification of replicated susceptibility variants provides 
considerable material for understanding biochemical pathways 
that govern diseases, particularly when the variants reside within 
the coding or regulatory regions of well-understood genes and are 
validated by functional studies (Manolio, 2010). Unfortunately, 
many disease-associated variants are located in regions of the 
genome that have not yet been functionally characterized. 
Indeed, 39% of the National Human Genome Research Institute 
(NHGRI) Genome-Wide Association Study (GWAS) catalog 
SNPs are annotated as intergenic and more than 36% are reported 
as intronic (Welter et al., 2014). The genes and pathways dis- 
covered can become targets for pharmaceutical intervention, 
especially when integrated with corroborating studies from dis- 
ease models, signal transduction experiments, bioinformatics, 
and protein biochemistry. Examples of using specific genes or 
their products as pharmaceutical targets have rapidly accumu- 
lated over the past decade and include mipomersen, an antisense 
therapeutic targeting APOB RNA for the treatment of hyper- 
cholesterolemia (Raal et al, 2010), ivacaftor which targets the 
G551D mutation in CFTR found in approximately 4% of indi- 
viduals with cystic fibrosis (Ramsey et al., 2011), inflammatory 
cytokines and their receptors (e.g., IL-ip, IL-12/23p40, IL-17A, 
IL-6R) (e.g., Krueger et al, 2007), other immune cell signaling 
proteins (e.g., CTLA-4, CD30), a variety of tumorigenesis genes 
that harbor somatic variants useful for individualized cancer 
treatment (e.g., BRAF, KRAS, and EGFR), and lipid transfer pro- 
teins (e.g., CETP, PCSK9), among many others. As more human 
genetics studies are conducted, the number of these druggable 
targets will expand. While the use of genetic results in pharmaceu- 
tical development is impressive, some of the most highly touted 
uses of genetic susceptibility data have been the accurate progno- 
sis of diseases (e.g., Mendelian and oligogenetic disorders, such 
as Tay-Sachs disease, phenylketonuria, Charcot-Marie-Tooth, or 
rare ciliopathies including polycystic kidney disease and Bardet- 
Biedl syndrome), or other areas that impact medical decisions, 
such as choice of drug, selection of dose, avoidance of side-effects, 
or determining the optimal intensity of clinical monitoring. 
Unlike identifying potential drug targets, genetic-based predic- 
tion models may serve a clinical purpose in advance of precise 



identification of the functional motifs and molecular mechanisms 
that drive genetic association/linkage signals. Instead, the util- 
ity of predictive models is derived primarily from the correlation 
patterns — provided that these are robust across intended popula- 
tions. However, the strength and robustness of the correlation are 
critical for a genetic prediction model to be clinically useful. 

As clinical decisions are specific to individuals, physicians aim 
to assess the probability of medical traits for each patient. This is 
a dynamic process where physicians update assessments as addi- 
tional relevant information, such as laboratory tests (both genetic 
and non-genetic), or changes in physiology become known. In 
this way, clinical decisions are informed as variation in an individ- 
ual's risk to disease, severity of disease and response to medication 
are progressively revealed. Thus, results from clinical tests, includ- 
ing genetic-based predictive models, are useful when they more 
accurately discern the likelihood of the medical trait (e.g., disease 
occurrence or response to medication), compared to the pre- 
test assessment. For example, if a physician had estimated that a 
patient had a 40% chance of having a particular disease prior to 
the results of a clinical test, and the 40% prediction remains unal- 
tered following the results of the test, then the clinical test and new 
prediction may be of little value. Further, whether the magnitude 
of this posterior-prior probability departure carries clinical utility 
depends on the specific application. As an illustration of this pro- 
cess, suppose a patient is referred to a rheumatologist. Prior to the 
visit, the rheumatologist may not have sufficient information to 
modify the assessment of the probability that the individual has, 
for example, rheumatoid arthritis (RA). Upon learning that the 
patient self-reported symptoms of symmetric sore joints that are 
partially remediated by non-steroidal anti-inflammatory medica- 
tion, the rheumatologist proceeds to update the probability of RA 
and of other conditions. Some diseases would increase in their 
likelihood, while others would decrease from their initial values. 
Following a standard evaluation of the classification criteria for 
rheumatoid arthritis assessing joint involvement, serology, acute- 
phase reactants, and symptom duration (Aletaha et al., 2010), 
the rheumatologist proceeds to further update the probability 
that the patient suffers from RA. Subsequent testing of a genetic 
panel of known RA-susceptibility markers, including polymor- 
phisms within HLA-DRBl, PTPN22, STAT4, CTLA4, TRAFl, 
CD40, etc., may further modify the posterior probability. This 
additional updated posterior probability may be particularly use- 
ful in situations where a definitive diagnosis was not available with 
non-genetic approaches alone. This is, of course, neither a new 
nor complete account of the diagnostic process, but it underscores 
the general nature of many medical decisions, where accumu- 
lation of information typically results in increasing accuracy in 
the appraisal of a medical trait probability for an individual. The 
process of serial refinement based on accumulating data is a hall- 
mark of the diagnostic process and, statistically, can be codified 
as Bayesian updating of the posterior probability of the trait. The 
aim of genetic-based predictive models is to augment existing lab- 
oratory, imaging, and other clinical data to improve the posterior 
probabilities (i.e., drive the posterior probabilities toward 0 or 1) 
of medical traits in a cost-effective manner. 

In the context discussed here, predictive models are methods 
designed to use clinical, analyte, genetic, or other types of data for 
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the purpose of forecasting a medical trait. Predictive models — 
including those based on genetic markers — are most beneficial 
when they yield actionable and individualized results. However, 
they are of reduced value if they only substantially modify medical 
decisions for an exquisitely small fraction of the patient popula- 
tion. Hence, the ideal genetic-based predictive model for clinical 
applications (1) substantially modifies the posterior probability 
of medical traits compared to that obtained from existing clini- 
cal assessment and tests — enough to enable changes in medical 
decisions and patient management, and (2) impacts a substan- 
tial fraction of individuals to whom it is applied and provides 
improved outcomes. While other considerations are essential, 
such as more cost-effective care and the ease of adoption and 
implementation of diagnostic tests, it is this concurrent maxi- 
mization of (1) modifying the posterior probability of the trait 
in the context of the benefits and risks of the specific medical 
decisions and (2) broad applicability that defines an archety- 
pal genetic-based predictive model. For example, genetic testing 
of CFTR mutations for cystic fibrosis is successful in that the 
recessive disease alleles have very high penetrance and the large 
majority of pathogenic mutations are covered with contempo- 
rary panels. Similarly, multi-gene tests for related rare diseases, 
each with high penetrance, can also serve as useful clinical tests 
(Rehm, 2013). 

In an attempt to develop such predictive models, many have 
used genome-wide association study (GWAS) results as they are 
a ubiquitous source of genetic information (Manolio, 2010). 
Attempts to use genetic information alone have not been as 
successful as previously hoped, with posterior probabilities that 
do not approach 1 or 0 and the vast majority of individuals 
having decidedly intermediate posterior probabilities. A seminal 
question is the extent to which genetic information can further 
modify posterior probabilities for those individuals thought to 
be of intermediate risk using traditional factors. Wray et al. offer 
an excellent review of the challenges involved in complex trait 
prediction with GWAS results (Wray et al, 2013). 

The discovery of genetic markers for the prediction of medi- 
cal traits is entirely dependent on the underlying genetic model 
that gives rise to the trait. That is, the number of loci and the 
number, frequency and penetrance of predisposing alleles deter- 
mine both the likelihood of identifying causal markers and the 
clinical utility of using those markers in a patient population. 
For example, monogenic disorders such as phenylketonuria, Tay- 
Sachs, or sickle cell anemia are likely fully penetrant with allele 
frequencies that are not exceedingly rare; and therefore genetic 
tests for such diseases have clinical applications, provided that dis- 
ease avoidance or disease-modifying treatments exist. However, 
traits like Alzheimer's disease, diabetes, or response to statins 
have etiologies that remain enigmatic. Whether or not these 
complex traits follow extremely polygenic modes of inheritance 
(i.e., weakly penetrant alleles, and several hundreds to thousands 
of loci), high locus/allelic heterogeneity (having highly pene- 
trant but unique loci and alleles involved across individuals), 
high levels of epistasis (e.g., genotypic effects that vary based 
on genetic background or other specific genotypes), ubiquitous 
epigenetic effects (e.g., methylation patterns, histone acetylation 
patterns, or transgenerational RNA artifacts affecting the trait). 



gene-environment interactions, or some combination thereof, 
directly impacts the identification of predictive markers as well 
as their utility. GWAS interrogate the common allelic architec- 
ture for disease predisposing markers exhibiting low degrees of 
allelic and locus heterogeneity, whereas sequencing-based studies 
in families can facilitate the discovery of rare disease-associated 
variants, but are not optimal for identifying ancestral disease- 
predisposing alleles. Therefore, it is reasonable to expect that 
genetic markers from GWAS may modify posterior probabilities 
across a large segment of the population, but with a muted impact 
on those probabilities. On the other hand, rare sequence variants, 
on the other hand, may have substantial impact on the posterior 
probabilities for specific individuals, but with little widespread 
effect. 

A review of the potential of genetic-based predictive mod- 
els to change medical practice in the short-term indicates that 
three areas have shown promise for improving clinical care: can- 
cer genomics, population screening for MendeUan diseases, and 
pharmacogenetics. These three areas profit from high penetrances 
of the genetic variants identified to date, though only a fraction of 
patients benefit from these tests. As these areas emerge from their 
infancy and additional genetic results accumulate, the proportion 
of individuals benefiting will likewise increase. 

PREDICTION USING TUMOR GENOMICS 

The advent of genetic testing in tumor cells, through harnessing 
the throughput and read depth of next-generation sequencing 
platforms, has enabled detailed and clinically actionable molec- 
ular pathology genetic tests for numerous cancers. Multiplex 
sequencing-based assays for biopsies compared to normal tis- 
sue are now available and have demonstrated usefulness in 
augmenting many clinical decisions. The utility of these tests 
relies on the clear relationship that has been delineated over 
the past two decades between specific driver mutations, treat- 
ment variants and cancer progression, and drug selection (Liaw 
et al., 1997; Paez et al., 2004; Agrawal et al, 2011; Walter et al, 
2012; Kandoth et al, 2013; Vogelstein et al, 2013). Intratumor 
(Gerlinger et al., 2012) and single-cell sequencing methods 
(Navin et al, 2011) offer the possibility of inferring the evo- 
lutionary history and driver mutations in clonal expansions of 
cancer cells. These techniques have been successfully applied to 
several cancers with excellent prognostic utility, for example, 
kidney cancer (Xu et al., 2012). For well-defined activating muta- 
tions such as those within BRAF (Loupakis et al., 2009; Borras 
et al., 2011), KRAS (Linardou et al, 2008) and EGFR (Lynch 
et al., 2004), the posterior probability of efficacious treatment 
selection is also high. Indeed, there seems to be a clear path 
to incorporating panels of well-defined oncogenesis, metastasis, 
and drug response variants through next-generation sequencing 
of tumors. Baylor College of Medicine, one institution among 
several offering a number of clinical genetics tests, has devel- 
oped a Cancer Gene Mutation Panel through next-generation 
sequencing that investigates 2855 known mutations within 50 
cancer-associated genes for clinical testing (http://www.bcm.edu/ 
cancergeneticslab/test_detail.cfm?testcode=9705). Other efforts 
include the UCLA Clinical Genomics Center (http://pathology. 
ucla.edu/body.cfm?id=105), the Emory Genetics Laboratory 



www.frontiersin.org 



June 2014 | Volume 5 | Article 162 | 3 



Schrodi et al. 



Genetic-based prediction of disease 



(http://genetics.emory.edu/egl/), and the Washington University 
School of Medicine (http://gps.wustl.edu/). Identification of a 
small number of specific mutations enables selective treatment 
courses to be taken with higher expected efficacy, albeit often with 
limited duration of effect due to the development of drug resis- 
tance, an expected consequence of monotherapy. For example, 
in this Baylor panel BRAF mutations are targeted, where treat- 
ment with vemurafenib and dabrafenib has demonstrated BRAF 
Val600-specific metastatic melanoma antitumor activity (Jang 
and Atkins, 2013). Over the past five years, somatic cell and tumor 
genomics has provided remarkable insights into the molecular 
pathobiology of cancers. This rapidly progressing field contin- 
ues to accumulate examples of improved treatment resulting from 
these genetic discoveries. 

PREDICTION IN SCREENING FOR MENDELIAN DISORDERS 

Equally impressive has been the progress in interrogating very 
highly penetrant alleles in population-based screens, particu- 
larly in newborns. Next-generation sequencing has enabled rapid, 
cost-effective multiplex assays that require little DNA. Given the 
high positive predictive value of these variants and the ability 
to modify clinical treatment in many of these Mendelian dis- 
orders, genetic-based prediction in this area is an efficacious 
addition to medical practice. For example, Saunders et al. recently 
showed the feasibility of screening for monogenic diseases across 
the genome within 50 h in a neonatal clinical setting (Saunders 
et al., 2012). Importantly, infants identified as having pathogenic 
genotypes (e.g., Kwan et al., 2013; Stefanutti et al, 2013) can 
receive appropriate treatment while stUl hospitalized, often avoid- 
ing life -threatening complications. Comprehensive genetic test- 
ing may preclude emotionally and financially costly pediatric 
odysseys (Kingsmore et al., 2011). In addition, the application 
of high-throughput sequencing approaches to clinically impor- 
tant, expansive gene panels can reliably identify known inherited 
pathogenic variants and new germline mutations that are poten- 
tially pathogenic, thereby driving effective early screening (Kurian 
etal, 2014). 

PREDICTION IN PHARMACOGENETICS 

Pharmacogenetics is the third area in which genetic variants can 
enable physicians to differentially prescribe certain medications 
to individuals to avoid adverse events or to modify dosing. The 
importance of these genetic variants in avoiding adverse drug 
reactions is underscored by FDA black-box warnings (http: 
//www.fda.gov/Drugs/ScienceResearch/ResearchAreas/Pharmaco 
genetics/ucm083378.htm), as well as by recommendations 
of other groups (https://www.pharmgkb.org/). For example, 
individuals carrying the HLA-B*5701 allele are warned against 
taking abacavir (Mallal et al., 2008), dapsone-treated patients 
with certain G6PD variants are at higher risk for hemolysis as 
are patients receiving primaquine, many sulfonamides, nitro- 
furantoin, acetanUide, niridazole, and naphthalene (Cappellini 
and Fiorelli, 2008), and the beta blocker propranolol can cause 
adverse reactions in those with variants conferring compromised 
CYP2D6 function (Cascorbi, 2003; Samer et al, 2013). In all, 
the FDA currently lists 155 pharmacogenetic warnings across 
numerous therapeutic areas. Again, the validated, high positive 



predictive value of these pharmacogenetic variants makes 
immediate clinical utility possible, if not immediately actionable. 
Clinically useful genetic variants underlying other pharmacoge- 
netic traits, such as differential response to many lipid-modifying 
medications, metformin, and anti-TNF therapies, still remain 
largely abstruse. 

Importantly, the setting of clinical application of genetic tests 
is critical to the usefulness of genetic-based predictive models. 
Traits, including drug response and adverse reactions, that are 
(1) otherwise easily diagnosed, or (2) for which disease man- 
agement would not change with the results of the predictive 
model, are poor candidates for these predictive models. So, results 
from the use of genetic-based predictive models must serve as 
a cog in the health management machinery and clearly satisfy 
an unmet medical need. For example, genetic-based predictive 
models are unlikely to play a useful role in diagnosing a bone frac- 
ture. Similarly, Kimmel et al. recently showed that even though 
genetics can fairly accurately predict warfarin sensitivity, this 
information offers no benefit over clinical management of war- 
farin dosing to achieve therapeutic range (Kimmel et al., 2013). 
The setting of medical care also plays an important role: nearly 
half of all patients are not treated in coagulation centers, leaving 
the question of how diagnostic genetic testing would fare in those 
environments. 

Why is it that these three above areas have enjoyed more suc- 
cess in applying genetic information to clinical practice than other 
applications, such as prognosis of complex diseases? In large part, 
the answer lies in the relatively low complexity of the genetic 
architecture behind these medical traits. The propagation of can- 
cer cells, tumor survival and metastasis are promoted by specific 
mutations that wield strong effects on promoting clonal expan- 
sions: driver mutations. Different driver mutations accomplish 
this task in different ways, but each driver mutation has pro- 
found effects on cellular metabolism, mitosis, and proliferation. 
Because the effects of these driver mutations are profound and 
characteristic of specific molecular pathophysiologies, it is not 
surprising that they are reasonably predictive of disease trajectory 
and chemotherapy response. Similarly, provided that the false 
positive rate of prognostic tests is low, population-based screens 
for Mendelian disorders have been a useful addition to modern 
medical practice because the penetrance of such traits is typically 
complete or nearly complete. That is, aside from the measure- 
ment error rates, the prediction of disease given a positive genetic 
test is accurate and reliable. Finally, although not as definitive as 
Mendelian disorders, pharmacogenetic effects identified to date 
testify to reduced complexity of these traits. This is particularly 
true of extreme adverse events (e.g., FDA black box warnings) 
and response with those drugs having highly targeted substrates. 
In contrast, therapeutics with multifold actions, such as statins 
or metformin, have exhibited much more recalcitrance to genetic 
dissection. 

In contrast to the above-mentioned areas, currently the predic- 
tion of common diseases presents a considerable challenge. Most 
common diseases have been relatively reluctant to reveal a large 
fraction of the genetic component of their etiologies (Manolio 
et al., 2009). Several studies of complex diseases have shown lit- 
tle improvement to disease prediction when adding genetic data 
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to already established disease risk factors (e.g., Thanassoulis and 
Vasan, 2010; Bao et al, 2013; Muhlenbruch et al, 2013); and, 
even if statistically significant, the models incorporating genetic 
information may not be clinically useful (Husing et al., 2012). 
While there are several instances of important, influential mark- 
ers that have been discovered in some common diseases, such 
as APOE in Alzheimer's, ARMS2 and CFH in age-related mac- 
ular degeneration (AMD), and numerous alleles in the MHC 
region for autoimmune and inflammatory diseases, many genetic 
linkage results are the result of multiple infrequent alleles and 
most replicated markers from GWAS have modest effect sizes. 
In combination, the replicated disease susceptibility alleles dis- 
covered thus far have yet to demonstrate substantial prognostic 
utility. That said, there are encouraging exceptions: the combined 
effect of the multiple identified loci for AMD or Crohn's disease 
may offer some clinical utility in selected circumstances. AMD 
is a leading cause of compromised vision and blindness, and 
individuals at heightened risk for AMD can benefit from more 
frequent eye exams and early treatment to curb the likelihood 
of permanent ophthalmic damage. Administration of anti-VEGF 
monoclonal antibodies have shown efficacy in exudative AMD 
treatment (Fung et al, 2007; Heier et al, 2012). Recent GWAS 
studies in AMD have demonstrated that the 19 top AMD risk loci 
are estimated to explain between 15 and 65% of the genetic por- 
tion of the variance in the phenotype (the proportion depends 
on the assumption of AMD prevalence being between 0.01 and 
0.10). This set of SNPs also generates an area under the ROC 
curve (AUG) of 0.74 (Fritsche et al., 2013), meaning that if you 
choose pairs of people at random, one with and one without 
AMD, and used their SNP data, the one with a higher probabil- 
ity of AMD would in fact be the one with AMD 74% of the time 
(Berrar and Flach, 2012). Incorporation of other known risk fac- 
tors, such as age and smoking, further improves this prediction. 
It is possible that other measures, including positive and negative 
predictive values or those based on posterior probability distri- 
butions, could provide better insight into clinical utility. Another 
promising area is the use of all genetic variants genotyped in a 
genome-wide array to construct predictive models, rather than 
restricting the markers to those that are strongly associated with 
the trait. Purcell et al. investigated the use of thousands of com- 
mon alleles in predictive models for schizophrenia and bipolar 
disorder, demonstrating an increase in the proportion of the max- 
imum variance in these traits explained as the trait-association 
significance level was relaxed (Purcell et al, 2009). In addition, 
analysis of the Wellcome Trust Case Control Consortium data 
for Crohn's disease appeared to indicate that expansion of the 
number of SNPs in a predictive model over just those reach- 
ing genome-wide significance improves the model performance 
(Kooperberg et al., 2010). These are interesting observations and 
consistent with results from Yang et al. (2010) and Lee et al. 
(201 1) describing the rather dramatic increase in the proportion 
of heritability explained with all GWAS SNPs compared to top 
SNP findings. Wei et al. offer a recent example of harnessing this 
effect for Crohn's disease where expanding the number of variants 
and using advanced machine learning techniques increased pre- 
dictive accuracy (Wei et al, 2013). With greater resolution, Yang 
et al. (2011) showed that the length of chromosomes is linearly 



correlated with the percentage of the variance attributable to a 
variety of phenotypes, including von Willebrand factor, height, 
BMI and QT interval. However, both theoretical and applied work 
appears to show limited utility of including more than a few 
hundred SNPs in commonly-used predictive models (Wu et al., 
2013; Warren et al., 2014). Nonetheless, methods that exploit the 
whole genome for disease prediction, such as extensions to Best 
Linear Unbiased Prediction (BLUP), continue to develop and may 
improve accuracy metrics for both binary disease and quantitative 
traits (Zhang et al, 2014). 

UTILITY AND METHODS 

Given that for many diseases with effective treatments accurate 
prediction of potential disease can play a critical role in deter- 
mining robust clinical care that may avert severe disease or even 
disease onset, it is essential to characterize the important aspects 
that produce useful predictive models. For traits with polygenic 
etiologies, methods must be used to combine signals from multi- 
ple genetic markers together into a cohesive metric for prediction 
(Wimmer et al., 2013). Seven main considerations when doing 
so are: (1) which genetic markers are to be included in the pre- 
dictive model — i.e., feature selection, (2) the frequencies of the 
susceptible/protective genotypes at each selected marker, (3) the 
strength of the correlation between the genotypes at each marker 
and the predicted trait, (4) the interactions between the effects 
sizes of different genetic markers, (5) the prevalence of the trait 
being predicted, (6) how the genetic data are envisioned to inte- 
grate into clinical practice in combination with non-genetic tests, 
and (7) a determination of the robustness of the prognostic 
signal across multiple populations, including those with varied 
ancestries. Over the past decade, several methods have been pro- 
posed to accomplish these tasks, including genetic risk scores, 
various types of regression-based approaches, Bayesian networks, 
and other machine learning methods. Importantly, polygenic 
disease-prediction models may serve as instrumental variables 
for Mendelian randomization analyses in the investigation of the 
causal role of genetic-based predictors in disease (Burgess and 
Thompson, 2013). 

FEATURE SELECTION 

Feature selection refers to the decision about which genetic vari- 
ants are most effective in determining the medical trait and 
should therefore be included in a predictive model. For example, 
it would seem reasonable to include SNPs in the CFH, ARMS2, 
C3, and C2/CFB/SKIV2L regions in a model predicting AMD 
because the evidence for correlation between AMD and these 
variants is both substantial and well-established. Further, selec- 
tion of these variants for inclusion in a predictive model would 
be prioritized over other variants with little or no evidence of 
utility in AMD prediction. Jakobsdottir et al. have investigated 
the properties of individual disease-susceptibility SNPs, showing 
that SNPs with highly significant odds ratios may be insuffi- 
cient to classify individuals (Jakobsdottir et al., 2009). There are 
several different methods that can be employed. For a general 
review see Guyon and Elisseff (2003). Care must be taken when 
internal validation techniques are applied to datasets, as the fea- 
ture selection must be incorporated in the internal validation 
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routine. Ideally, feature selection should be replicated in an inde- 
pendent sample set. Approaches based on stepwise selection of 
features are popular. The performance of models constructed 
based on a stepwise selection can be evaluated based on model 
fit, accounting for the complexity of the model — the Akaike and 
Bayesian information criteria are examples of measures to do 
this (Akaike, 1974; Schwartz, 1978). Aside from purely statisti- 
cal and computational approaches, use of biological information 
can improve the selection of genetic markers. By integrating 
information from numerous decades of biochemistry, molecular 
biology and cellular physiology — the direct phenotypes of genetic 
variants — one can construct predictive models weighted toward 
those variants segregating in functionally relevant regions in an 
effort to improve the robustness of the model and ease of appli- 
cation to related phenotypes. For example, if one is generating 
a genetic-based predictive model for Crohn's disease response to 
IL- 1 7 monoclonal antibody therapy, higher prioritization of vari- 
ants within IL-17-related genes or those polymorphisms that are 
known to modify T-helper cells expressing IL-17 (Thl7) activity 
may provide complementary information and yield a higher like- 
lihood of the test having utility when applied to other populations 
or related phenotypes. 

GENETIC RISK SCORES 

Genetic Risk Scores (GRS), determined simply on the basis 
of published GWAS results, are among the simplest meth- 
ods employed for genetic prediction. The majority of these 
approaches construct the predictive model based on the sum 
of predisposing genotypes that each individual carries, either 
unweighted or weighted by the effect size of the specific predis- 
posing genotypes. The essential approach is to take a weighted 
sum of risk alleles, choosing the risk alleles based on those found 
to be genome-wide significant in a recent meta-analysis (e.g., 
for BMI, see Speliotes et al., 2010). Weights are determined for 
each risk allele by the P estimates from the meta-analyzed GWAS. 
Unweighted GRS treat each risk locus equally. To illustrate the 
weighted GRS approach, assume that k SNPs are known to be 
genome-wide significant and further assume that the correspond- 
ing P weights from the GWAS are denoted as w, for the rth 
SNR Then the be calculated as: GRS = 2^,- w,Ri; where Ri is the 
number of risk alleles at the ith SNR Speliotes et al, using 32 
confirmed obesity-associated loci, showed the distribution of the 
weighted number of risk alleles across the population used in 
the Atherosclerosis Risk in Communities (ARIC) study, and pre- 
sented a corresponding AUG for the GRS (Spehotes et al., 2010). 
Although significantly different from that expected under the 
null, the AUG for this example was exceedingly modest (0.515), 
where flipping an unbiased coin would be expected to have an 
AUG of 0.500. In another example, Ripatti et al. developed a 
genetic risk score based on 13 SNPs discovered to be associated 
with coronary heart disease, myocardial infarction or both, in 
seven reports (Ripatti et al, 2010). For each individual, the effects 
of these SNPs were combined by summing the number of risk 
alleles and the resulting risk score was partitioned into quin- 
tiles for the purpose of creating a categorical variable. Comparing 
extreme quintiles, the authors found roughly a 1.7-fold increased 
risk for coronary heart disease in the top risk quintUe compared 



to the lowest risk. The genetic risk score did not show a significant 
effect of the net reclassification of individuals over traditional risk 
factors and family history. The combined genetic effect was able 
to slightly improve the risk classification of those individuals who 
were previously thought to have intermediate risk as determined 
by traditional risk factors, but may not have strong clinical util- 
ity. Increasing the number of informative genotypes and/or the 
traditional risk factors may improve the prognostic performance 
of GRS. Other applications, including age-related macular degen- 
eration, exhibit more promising performance (Grassmann et al., 
2012; Seddon et al., 2014). 

REGRESSION METHODS 

Regression methods, familiar tools for constructing prediction 
models for both dichotomous and quantitative traits, can lead 
to more general predictive models than simple GRSs. One of 
the first reports of a cohesive method using multiple replicated 
markers under a general logistic regression model was devel- 
oped by Yang et al. (2003). Yang and coworkers proposed using 
a general logistic regression model to estimate the ratio of the 
probability of the genotype information given disease to the 
probability of the genotype information within the non-diseased 
population. Incorporation of covariates and interaction effects 
are possible with this generalized form. Currently, regression 
is still commonly used for disease prediction. For example, a 
search of PubMed revealed 10 articles published in 2013 which 
applied regression methods for the prediction of a variety of 
diseases, including cerebrovascular disease, age-related macular 
degeneration, and hypertrophic cardiomyopathy (Abraham et al, 
2013; Borque et al., 2013; Gruner et al, 2013; Harada et al, 
2013; Mondul et al, 2013; Romano et al., 2013; Schellekens 
et al, 2013; Sharma et al, 2013; Tsai et al, 2013; Uddin et al, 
2013). In addition, extensions including regression of the whole 
genome using a Best Linear Unbiased Prediction method (G- 
BLUP) can produce more highly predictive models (de Los 
Campos et al., 2013). Importantly, Yang et al. (2009) pointed 
out that one should not rely on point estimates alone, but also 
provide a measure of the uncertainty of the risk estimates. Risk 
estimates depend on a variety of parameters, each of which 
may be estimated with some uncertainty. Cumulative uncertainty 
across all estimated parameters leads to uncertainty of the risk 
estimates. 

There are several modeling assumptions made when applying 
either linear or logistic regression but, in the specific application 
area of genetics, the following concerns should be emphasized. 
First, multicoUinearity between nearby markers is usually a seri- 
ous concern. For markers in high linkage disequilibrium with 
each other, it is common to select the variant with the lowest p- 
value for inclusion in the model. Principal component regression 
is another useful way to address concerns arising from mul- 
ticoUinearity. For example, Gauderman et al. found that this 
approach performs well when applied to a single candidate gene 
(Gauderman et al., 2007). Another concern is marker-marker 
interactions. For parsimony, it is common practice to ignore 
interactions. Interaction analysis is not easy to conduct and can 
be computationally intensive. Furthermore, substantially larger 
sample sizes are typically needed to detect interaction effects than 
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are needed to detect main effects. However, ignoring interac- 
tions may underestimate genetic effects, and improvements in 
the understanding of interactions would be expected to improve 
genetic risk prediction models (Thanassoulis and Vasan, 2010). 
Missing data are commonly problematic since genotype success 
rates are never perfect (Kim and Misra, 2007). One strategy is to 
drop samples with missing data (Schwender and Ickstadt, 2008). 
Otherwise, when possible, imputation can be a useful solution for 
"filling in" missing data (Yuan, 2000). 

Usually, for presence vs. absence of disease phenotypes, a 
predictive model is first developed by analyzing a case-control 
dataset, and then applied to a particular population. To provide 
risk estimates that are calibrated to that particular population, an 
adjustment which depends on the case to control ratio must be 
made to the intercept term of the case-control regression model 
(Yang et al, 2003). 

Many studies, but not all (Warren et al., 2014), indicate that 
risk prediction would be more accurate if more predictors could 
be added in the risk model (De lager et al., 2009; van Dieren 
et al., 2012). But the confidence interval (CI) of the risk esti- 
mate is often not considered in the evaluation of the risk model. 
When the model is built using regression in a meta-analysis of 
many case-control datasets, confidence intervals are often not 
even estimated. 

Provided it is unbiased, a more precise risk estimate with a 
smaller CI from a model with fewer predictors is better than a 
less precise risk estimate with a larger CI from a model with 
more predictors (Shan et al., 2013). To compute the CI for the 
risk estimates from a meta-analysis, each individual study in the 
meta-analysis should do a joint analysis and return coefficient 
estimates and the variance-covariance matrix for the coefficients. 
Then, these can be combined to estimate the overall variance- 
covariance matrix and a precise CI for the risk estimates. Goddard 
et al. developed a method that derives an empirical CI combin- 
ing all relevant sources of variation in disease risk (Goddard and 
Lewis, 2010; Crouch et al, 2013). 

BAYESIAN NETWORKS 

Bayesian Networks have resulted from the application of advances 
in graph theory to applied probability and carry a high degree of 
interpretability, along with providing an intuitive framework for 
obtaining posterior probabilities and the treatment of classifica- 
tion problems (Pearl, 1988; Jordan, 2004). If the features (genetic 
markers) within the Bayesian Network can be reasonably modeled 
as being conditionally independent (conditional on the disease 
trait in our application), then the network is reduced to a highly 
tractable Naive Bayes model. Given a set of n genetic markers, 
using Bayes' rule one can write the posterior probability of the 
disease trait (PPD), as: 



denote disease and D - 
the PPD as: 



0 for non-disease), one can re-write 



P{n"=iG,\D)P(D) 



where D denotes a random variable for the disease trait and n 
genetic markers are used in the prediction. Under the condi- 
tional independence assumption of Naive Bayes, we can com- 
pletely factorize the product and, for a binary trait (D = 1 to 



PPDn 



P(D 



P(D =l)U"=i P(Gi\D=^) 

l)U"=iPiG,\D^l)+ P(D^0)U"=iP{G,\D^O)' 



To illustrate this type of calculation. Figure 1 shows scaled PPD 
values for a rheumatoid arthritis study. In this study (Chang 
et al., 2008), the PPD for every possible three-locus genotype 
combination at the risk loci {HLA-DRBl, the R620W polymor- 
phism at PTPN22, and diplotypes at TRAFl) was calculated, 
and scaled such that the smallest value was set to 1; SRR denotes 
this scaled ratio (Figure 1). While there is substantial variability 
across the values for different genotype combinations: over a 
41-fold difference in predicted rheumatoid arthritis-risk, it is 
important to keep in mind how these bins are populated with 
individuals with and without the disease trait (the case-control 
frequencies given for each combination), for a prognostic loses 
general utility as intermediate combinations become frequent. 
In concrete terms, while a 41-fold difference is impressive, only 
0.1% of the general population is calculated to carry genotypes 
producing this level of effect. 3.2% have multi-locus genotypes 
that generate at least a 21 -fold increase in RA risk, and 13.7% 
carry a multi-locus genotype with >5-fold increase in RA risk 
(all compared to the lowest category). 

Figure 2 displays the results from a simplified model. Five 
hundred disease susceptibility SNPs, all having equal effect sizes 
and genotype frequencies, were modeled. A prior probability of 
disease was set to 0.20 and the predisposing genotype frequency 




FIGURE 1 I Rheumatoid arthritis scaled posterior probabilities (SRR). 

Genotype data at three strongly predisposing loci, HLA-DRBl, TRAFl, and 
PTPN22 are combined and the posterior probabilities calculated for every 
possible multilocus genotype combination. The prior probability was set to 
the approximate population prevalence of rheumatoid arthritis, 0.01. The 
posterior probabilities are scaled such that the lowest RA-risk multilocus 
genotype was set to a value of 1. The results show a 41-fold variation in 
posterior probabilities. The expected frequencies of the various multilocus 
genotype combinations in RA patients/controls are shown at the top of 
each bar. 
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FIGURE 2 I Posterior probability variation with relative risk. Tlie 
density of posterior probabilities of disease (PPD) are shown under a 
simplified multilocus disease model. The number of independent, 
disease-predisposing SNPs was set at 500. Relative risk was modeled as 
being identical for each predisposing SNR Frequency of the predisposing 
genotype in controls was set to 0.05 at each SNR Prior probability of 



disease was set at 0.20. Na'fve Bayes was used to calculate posterior 
probabilities. The data points only take on discrete values (The densities 
are composed of discrete values which are connected by lines to 
produce the curves. While the sum of the discrete values all equal one 
in each of the curves, the areas under the curves do not), but are 
presented with interconnecting lines. 



in the general population was set to 0.05 for all 100 SNPs. As 
expected, for very small effect sizes the number of individuals 
calculated to have posterior probabilities close to 0.20 is high 
and rapidly tails off. However, for larger effect sizes, there is an 
accumulation of individuals with posterior probabilities closer to 
1 and 0. Interestingly, even with quite considerable effect sizes 
(for high frequency alleles) much of the density still resides in 
the intermediate region between 0.10 and 0.90. If we explore the 
dynamics as the number of loci is increased, so does the variance 
in the posterior probability of the disease trait (Figure 3). 

Diagnosis or prognosis of disease traits with genetic informa- 
tion are classical problems of classification and clustering within 
machine learning. Hence, numerous machine learning methods, 
such as neural networks, support vector machines, and random 
forests can be applied to these types of data sets. Currently, the 
use of these methods to address problems using gene expression 
is arguably more advanced than the analogous methods applied 
to DNA variation data. 

QUANTIFYING PROGNOSTIC UTILITY 

Within a population studied, once each individual is ( 1 ) assigned 
a score for a risk metric, (2) assigned a posterior probability, 
(3) clustered or (4) classified, a method for assessing prognostic 



utility is required to quantify the usefulness in clinical practice. 
The most common method used is the area under the ROC 
curve, or AUC. However, although this metric is useful to assess 
discrimination, it is not the appropriate measure to assess a pre- 
dicted probability (Cook, 2007). Graphically, the ROC curve is 
a plot of the performance of the predictor in a space defined 
by the sensitivity (true positive rate) and 1 — specificity (false 
positive rate). Varying the threshold of calling a result positive 
or negative, a curve can be produced for the predictive model. 
The AUC is the integral of the curve. For a completely non- 
informative predictor, the AUC is 0.50, with larger values (up to 
unity) indicating improved prognostic utility (Figure 4). While 
useful, sensitivity and specificity are probabilities conditional on 
the state of the phenotype trait. One may want to consider met- 
rics that have differential performance with the prevalence of 
the disease trait. Indeed, all other diagnostic factors being equal, 
a physician should be more prone to diagnose an individual 
with a more common phenotype than an exceedingly rare one, 
because the a priori likelihood of the disease being the common 
phenotype is higher than the likelihood for the rare phenotype. 
Therefore, use of positive and negative predictive values (PPV and 
NPV) may be more useful in the clinical setting. PPV is defined 
as the proportion of true positives out of all positive results as 
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FIGURE 3 I Posterior probability variation with number of predisposing 
loci. The density of posterior probabilities of disease (PPD) is shown under a 
simplified multilocus disease model. The relative risk of each independent, 
disease-predisposing SNP was set to 2.0. Prior probability of disease was set 
at 0.20. Frequency of the predisposing genotype in controls was set to 0.05 



at each SNP The number of predisposing loci was increased from 20 to 1000. 
Naive Bayes was used to calculate posterior probabilities. The data points 
only take on discrete values (the larger number of loci have many more data 
points reflecting the larger number of possible multilocus genotype 
combinations), but are presented with interconnecting lines. 



determined from applying a diagnostic test. Conversely, NPV is 
defined as the proportion of negative results that are indeed truly 
negative. However, a direct ROC analog of characterizing the 
tradeoff between PPV and NPV offers challenges. Motivated by 
this, Pencina et al. suggest that averaging over PPV and NPV 
may provide an improved metric for characterizing prognos- 
tic/diagnostic utility (Pencina et al, 2008). In 2006, a new method 
for characterizing disease predictions based on proportions of 
individuals accurately reclassified was presented (Cook et al., 
2006). This approach was further developed in subsequent pub- 
lications, describing the employment of the Hosmer-Lemeshow 
goodness-of-fit statistic and the net reclassification improvement 
statistic applied to reclassification categories as predictive mea- 
sures (Cook, 2007; Cook and Ridker, 2009). The authors applied 
these approaches to better specify results from cardiovascular risk 
models. 



Another approach would be to characterize the improvement 
in the distribution of posterior probabilities as compared to the 
distribution of prior probabilities, where the distribution is taken 
across all individuals evaluated. The more informative the genetic 
information becomes, the larger the departure between poste- 
rior and prior probability densities. A natural measure for this is 
the KuUback-Leibler Divergence, which quantifies the departure 
between two densities (KuUback and Leibler, 1951). Applied to 
characterizing the improvement in disease prediction following 
the interrogation of a suite of features such as genetic markers, 
the KuUback-Leibler Divergence is defined as: 



Dkl = P{Disease \ Gi, . . . , G„) log 



P{Disease \ G\, . . . , G„) 
P(Disease) 



where G, are the random variables describing the states of each 
genetic marker involved in disease susceptibility, and the sum is 
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FIGURE 4 I AUC. Tine figure sliows the ROC curve and corresponding area 
under tine ROC curve (AUC). The expected patterns under two extreme 
scenarios are shown: an ideal diagnostic scenario and the pattern expected 
using random predictions. 



over all possible multilocus genotype combinations. Dkl is calcu- 
lated across the entire population to whom the predictive model 
is applied. Larger values of Dkl indicate enhanced differences 
between the posterior and prior probabilities across the popu- 
lation, reflecting the greater utility of the genetic information. 
Hence, the KuUback-Leibler Divergence concurrently captures 
both the magnitude of the effect the genetic data have on the pos- 
terior probabilities for each individual (compared to the prior) 
and the proportion of tested individuals exhibiting each magni- 
tude of the effect. Empirical-based calibration of this or other 
measures of prognostic utility can often be accomplished through 
using well-studied data sets having standard prognostic tests such 
as the Framingham population and cardiovascular disease risk 
score (Wilson et al, 1998; Schrodi et al, 2009). 

Another possible method for characterizing would be to define 
some level of probability that is clinically meaningful for the spe- 
cific application. That is, define critical levels Tpos and r„eg such 
that exceeding these values with the posterior probability of dis- 
ease provides actionable information for a clinician. Define the 
conditions: 

Cii P{Disease \ Genotype Data, other features) > tpos 
Ci: P(pisease \ Genotype Data, other features) < x„eg. 
We explore the dynamics of Ci and C2 as a function of the 
prior probability of disease in Figure 5. The collective effect of 
100 disease-predisposing SNPs, each with relative risk 2.0 and 
genotype frequency 5%, is clearly not sufficient to concurrently 
generate high proportions of individuals who are well- classified 
as either being likely (Ci) or unlikely (C2) to have disease. At 
prior probabilities close to 0.50, the majority of individuals do 
not satisfy either condition Ci or C2. It is only in the situations 
where the prior probability is either close to 0 or 1 that large per- 
centages of the population interrogated achieve very high or very 
low posterior probabilities. Hence, with current results from dis- 
ease genetics, it seems reasonable to assume that a clinician should 
already have a strong suspicion either of a disease diagnosis or the 
exclusion of a disease to warrant the use of SNPs. 

To further explore these prognostic utility patterns, we con- 
sidered two simplified disease models: a highly polygenic model 
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FIGURE 5 I Effect of prior probability. The frequency of multilocus 
genotype combinations exceeding the Ci and C2 thresholds for posterior 
probabilities of disease (PPD) (set at 0.05 and 0.95, respectively) are 
presented as a function of the prior probability of disease. 100 predisposing 
SNPs were used in the model, each having a predisposing genotype 
frequency of 5% in controls and relative risk of 2.0. 



consisting of 1000 predisposing SNPs, each of appreciable fre- 
quency (10% in controls) (Figure 6). We set the prior probability 
of disease to 0.20. As the relative risk of each SNP is increased 
from 1.02 to 1.80, the C2 condition exhibits a sigmoidal pattern, 
climbing to over 80% roughly when the relative risk hits 1.45. 
In contrast, the Ci condition peaks at roughly the same relative 
risk and declines thereafter, but never exceeding 0.02. A typi- 
cal large GWAS experiment would be well-designed to identify 
the SNPs with relative risks in excess of roughly 1.1. The collec- 
tive effect from the 1000 SNPs is not sufficient to overcome the 
prior probability of 0.20 to promote frequent individual mul- 
tilocus genotype combinations to exceed the 0.95 threshold of 
Ci. That said, the proportion of individuals with posterior prob- 
abilities exceeding the C2 < 0.05 threshold was much higher. 
We explored a highly penetrant, rare allele model (Figure 7). We 
constructed this model with 100 predisposing single nucleotide 
variants (SNVs) with predisposing genotypes being rare (0.1%), 
offset by large effect sizes ranging from relative risks of 10 to 
400. Sequencing studies generate numerous SNVs. In each graph 
a single effect size was assumed for all SNVs. Again, the prior 
probability was set to 0.20. Here, the C1/C2 dynamics are more 
complex, with the Ci and C2 conditions being very sensitive to 
individual multilocus genotype combinations. Modeling a dis- 
tribution of SNV frequencies would smooth this type of graph. 
These are overly simplified cases examined here and the param- 
eter space is vast — additional work in this area would provide 
useful insights into the properties of prognostics that result from 
different genetic-based disease models. That said, the proportion 
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Highly Polygenic Model 




FIGURE 6 I Highly polygenic model. The dynamics of tfie C1/C2 thresliold 
values under a simplified model is shown as the relative risk of the SNPs 
varies. The highly polygenic model has 1000 predisposing SNPs each 
having predisposing genotype frequencies in controls equal to 10% and a 
prior probability equal to 0.20. The relative risk was varied from 1.02 to 1.80. 



of individuals satisfying Ci is dramatically higher than under 
the highly polygenic model. Further, high values of Ci and C2 
occur concurrently. Although much more work is needed to fully 
explore these dynamics, this observation may give some hope to 
the usefulness of rare, highly penetrant sequence variants in the 
context of disease prediction. However, one might have expected 
that the most common results of GWAS analyses — identification 
of large numbers of common variants each with small impact 
on disease risk for common diseases — would be more useful, 
unless there are other effects operating, such as considerable locus 
heterogeneity for common diseases. 

EXAMPLES 

GWAS DATA APPLIED TO A TYPE 2 DIABETES PROSPECTIVE COHORT 

Type 2 diabetes (T2D) is a common medical condition with 
rapidly increasing incidence worldwide (Zimmet et al., 2001). 
This disease is characterized by a multitude of abnormal patho- 
physiological states involving muted beta cell response, chronic 
inflammation, and aberrant levels of metabolic markers that 
ultimately lead to vascular damage, infection, heightened car- 
diovascular disease risk, and neuropathy (Zimmet et al, 2001). 
Numerous T2D GWAS have been conducted and have reliably 
identified new genes and genetic regions involved in T2D sus- 
ceptibility, albeit with modest effect sizes (McCarthy, 2010). 
Early prediction of T2D onset and trajectory can be leveraged 
into improving important medical decisions, including treatment 
with therapeutics, exercise programs, and diet restriction. It is 
possible that genetic variants may play a role in improving the 
prediction of T2D. To test this idea, Shigemizu et al. very recently 
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FIGURE 7 I Highily penetrant model. The highly penetrant model uses 
100 SNPs each having a predisposing genotype frequency of 0.1 % and also 
a prior probability of 0.20. The relative risk takes on values from 10 to 400. 
Although the highly polygenic model yields a large proportion of individuals 
with posterior probabilities below 0.05, the increasing relative risks have 
little impact on the proportion of individuals with posterior probabilities 
above 0.95. The highly penetrant model shows an overall increase in the 
proportions of individuals with posterior probabilities below 0.05 and above 
0.95, but the patterns are somewhat unexpected (not smooth, nor 
monotone). These patterns are generated from all predisposing SNVs 
having identical genotype frequencies and relative risks, coupled with 
having specific PPD thresholds. 



performed a two-stage study (training and test sets) that resulted 
in combining nine SNPs with three clinical risk factors (age, gen- 
der, and BMI) to develop a predictive model for T2D in a prospec- 
tive cohort having Japanese ancestry (Shigemizu et al., 2014). 
The features used in a regression model for T2D-prediction were 
selected using a Bayes Factor and lasso method. From both genetic 
and clinical risk factors, the resulting predictive model showed 
reasonable AUG values in the independent test set (AUG = 0.808). 
Further, where the sensitivity and sensitivity were concurrently 
maximized, the model yielded a PPV and NPV of 77.8 and 73.8%, 
respectively. Although the selected SNPs did add to the diagnos- 
tic performance of the prediction model, they only did so in an 
incremental fashion. The model using SNPs, interactions, and 
clinical risk factors exhibited a 1.5% increase in the AUG over 
the clinical risk factors alone. Perhaps the discovery of additional 
T2D risk variants from sequencing efforts, rarer exome variants, 
extensive epistatic models, and/or undiscovered epigenetic factors 
will drive further work in this area to markedly improve the per- 
formance of T2D predictive models using heritable information. 
Until then, there may be greater gains through the use of dynamic 
markers like metabolite profiling and proteomics. Alternatively, 
exploration of prediction within T2D subgroups may offer a more 
fruitful avenue of inquiry. 
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STROKE PREDICTION USING GENETIC RISK SCORES 

Stroke events are major contributors to mortality and morbid- 
ity, constituting the fourth leading cause of death in the United 
States. Accurate prediction of ischemic stroke risk would enable 
medical interventions to at least partially remediate stroke occur- 
rence and the resulting brain damage. Very recently, two large 
studies (Ibrahim-Verbaas et al., 2014; Malik et al., 2014) have been 
published evaluating risk models constructed from a number of 
stroke and related phenotype-associated GWAS SNPs. The results 
were consistent with GRSs achieving statistical significance, but 
adding little in diagnostic utility to clinical features, as measured 
by AUG. Ibrahim-Verbaas et al. evaluated the performance of 
a 324-SNP GRS in four population-based cohorts totaling over 
22,000 individuals in an effort to improve the discrimination of 
ischemic stroke over that generated from the Framingham Stroke 
Risk Score Model, age and sex. The SNPs were selected based 
on association with stroke -related phenotypes and a GRS con- 
structed using weights from the regression model used to test 
the disease association. ROG curves from the results for the study 
show that the weighted GRSs do not substantially add to the AUG 
over that achieved by the Framingham Stroke Risk Score and 
sex — the improvement in AUG from the GRS was approximately 
0.02 for all stroke as well as for ischemic stroke alone — although 
the AUG improvement was statistically significant. The authors 
also examined the impact of the GRS on the net reclassification 
index, showing statistically significant, but incremental improve- 
ment. Goncurrently published in the same issue of Stroke, Malik 
et al. presented similar results for their stroke GRS performance in 
comparison to clinical features using overlapping samples (Malik 
et al., 2014). The study showed increased stroke risk across quin- 
tiles of their GRS, obtained from an analysis of independent 
samples from the Wellcome Trust Gase Gontrol Gonsortium 2 and 
the METASTROKE consortium. Slightly under a 1.5 -fold increase 
in risk was found comparing the top quintUe to the middle quin- 
tile, and a > 2-fold increase comparing the lowest quintile to the 
top quintile. No significant improvement in the net reclassifica- 
tion was observed and the ROG curves with and without the 
GRS are virtually superimposable for a sample set composed of 
a clinical trial-based derivation sample set and the replication 
sample set. 

PREDICTION USING BIOBANK DATA 

Gurrent efforts to discover and employ genetic risk predictors 
across multiple health care systems include those of the Electronic 
Medical Records and Genomics (eMERGE) Network (Gottesman 
et al., 2013). The eMERGE Network has supported large-scale 
genotyping efforts in biobanked DNA samples linked to elec- 
tronic medical records. As such, a repository of genome-wide 
genetic data can be interrogated with respect to a vast amount 
of clinical information. One use of these data is to investigate how 
sets of genetic markers can stratify sample sets for the purpose 
of performing historical prospective studies. By analyzing longi- 
tudinal data, one can specify the sets of individuals to "foUow" 
from a point in time to test for association with various medical 
traits. In doing so, one can perform a prospective study relat- 
ing genotypes to the accumulation of various medical outcomes 
and laboratory values. This is an excellent venue for evaluating 



genetic-based predictive models. For example, suppose one con- 
structed a predictive model for myocardial infarction (MI) with 
existing literature findings and then assigned a predicted MI risk 
for each individual. One could then evaluate how the predicted 
risk was correlated with the actual conversion rate of non-MI 
individuals to MI disease states. One can also simultaneously per- 
form association testing between any combination of sequence 
variants and/or GWAS SNPs and prospectively occurring disease, 
for the purpose of discovering novel genotype-phenotype corre- 
lations. Notably, this type of experimental design is less subject to 
confounding effects when compared to retrospective case-control 
designs because a cohort-based design is less likely to impart bias 
from sample selection being correlated with genetic factors. As 
noted in a 2010 Institute of Medicine "Rapid Learning" docu- 
ment, the hope is that electronic medical records, biobanks and 
bioregistries will provide evidentiary support for intervention 
decisions (National Research Council, 2010). Interesting, Lauer 
and D'Agostino recently suggested that the next disruptive tech- 
nology in clinical research would be the randomized registry trial 
(Lauer and D'Agostino, 2013). 

Deeply phenotyped biobanked datasets can also be used to 
redefine disease states. GWAS have highlighted SNPs that are 
undoubtedly correlated with susceptibility to common diseases 
but, as we have discussed, the alleles discovered thus far explain 
only a marginal amount of disease heritability. The reasons for 
this are the subject of much debate. Resolution of this perplex- 
ing problem will likely involve a multitude of discoveries, not the 
least of which stem from addressing the opaque correspondence 
between clinical phenotypes and underlying molecular patholo- 
gies. Due to reliance on observations of complex, gross physiology 
in the clinic, it is reasonable to assume that there may be multi- 
ple molecular etiologies that map to a single clinical disease state 
(e.g., estrogen receptor status now meaningfully partitions pre- 
viously indistinguishable breast cancers and leads to profound 
changes in the use of Tamoxifen) (Fisher et al., 1988, 1989; Park 
et al., 2004). Conversely, single molecular perturbations may have 
pleiotropic effects (e.g., the rs2476601 SNP in PTPN22 is strongly 
associated with several, clinically distinct autoimmune diseases) 
(Begovich et al, 2004; Bottini et al, 2004; Kyogoku et al, 2004; 
Velaga et al, 2004; Ganton et al, 2005; GrisweU et al, 2005). The 
medical field is accustomed to defining diseases with regard to 
visual inspection and gross anatomical measurements, and there- 
fore may (1) aggregate disparate molecular pathophysiologies and 
(2) partition the same molecular processes into different disease 
classes. Indeed, there is not a one-to-one mapping between clini- 
cal assessments of disease and molecular processes. Thus, it seems 
reasonable to adopt the reductionist stance that redefining dis- 
ease states and processes in terms of the underlying genetic and 
molecular variation may significantly aid investigation of dis- 
ease etiologies. In this way, one can construct phenotype-based 
predictive models for sets of genetic/molecular information — a 
reverse genetics approach. Several groups have recently taken this 
approach to mapping disease genes: Pendergrass et al. used this 
method to interrogate data from the PAGE network (Pendergrass 
et al., 2013), Hebbring et al. (2013), have performed similar 
types of studies in the Marshfield Personalized Medicine Research 
Project samples, and Denny et al. utilized data from the eMERGE 
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Network (Denny et al., 2013). In these studies, clinical pheno- 
types are screened in electronic medical record (EMR) systems for 
association with specific genetic variants with known fiinction (or 
highly likely to have specific impact on biological pathways) — a 
method pioneered by Ritchie et al. in a large-scale effort to repli- 
cate numerous associations using DNA databanks linked to EMRs 
(Ritchie et al, 2010). Novel disease associations can be discovered 
through these "PheWAS" studies. In addition, this "bottom-up" 
(specific genetic variants-to-phenotype) approach can also be 
viewed as a starting point for using genetic information to rede- 
fine disease states in a classification system that more closely 
mirrors the underlying molecular pathophysiology. For example, 
screening diseases within a biobank for association with IL23R 
missense variants uncovers sets of disease phenotypes where aber- 
rant Thl7 signaling plays a pathogenic role. Autoinflammatory 
diseases, including ankylosing spondylitis, psoriasis, and Crohn's 
disease, would all show a common, core aspect to their molecular 
pathophysiology. Additionally, partitioning by these same vari- 
ants allows elucidation of disease subgroups. This reclassification 
can further enable disease prediction, for the phenotypes pre- 
dicted would exhibit clearer correspondence with the underlying 
molecular mechanisms. 

INFLAMMATORY ARTHRITIS PREDICTION IN THE MARSHFIELD 
POPULATION 

To illustrate how to apply machine learning methods to empiri- 
cal datasets for the purpose of disease prediction and show some 
of the difficulties with attaining strong predictive signals from 
GWAS findings, we present an example of using genetic data and 
samples from an EMR-linked biorepository for the purpose of 
distinguishing between inflammatory arthritis conditions. 

Worldwide and within the US, inflammatory arthritides are 
common conditions representing a substantial portion of dis- 
abling disease. Early treatment of these conditions can provide 
substantial benefit in averting disabling articular damage and 
systemic complications. In general, autoimmune and autoinflam- 
matory diseases such as rheumatoid arthritis and spondyloarthri- 
tides have significant heritabilities — a substantial portion of 
which has been explained by identified polymorphisms, thereby 
motivating the incorporation of genotype information into prog- 
nostic approaches. This study aimed to investigate and charac- 
terize the ability of a panel of genetic markers, identified from 
genome-wide association study results, to classify individuals into 
the three inflammatory arthritis categories: rheumatoid arthri- 
tis, axial spondyloarthritis and psoriatic arthritis (Table 1). Using 
genotyped samples from an independent sample set from Central 
Wisconsin (Marshfield Clinic), several machine learning meth- 
ods were applied to a filtered set of these polymorphisms to 
classify individuals into the three inflammatory arthritis dis- 
eases, blinded to the known disease status. The WEKA software 
package was used to implement the machine learning algo- 
rithms (Holmes et al, 1994; Hall et al, 2009). Accuracy was 
defined as the proportion of positive and negative classifica- 
tion results that were in fact true. The Naive Bayes classifier 
attained the highest average accuracy from 10-fold cross val- 
idation on the training set (Table 2). However, when applied 
to the Marshfield test set, there was a substantial decline in 



Table 1 | Illustration of machine learning methods applied to genetic 
data: feature selection. 



Ankylosing spondylitis Psoriatic arthritis 



Rheumatoid arthritis 



Rs13203464 (HLA-B27) 
Rs30187 (ERAP1) 
Rs11209026 (IL23R) 
Rs10865531 (2p15) 
Rs2310173 (IL1R2) 
Rs4333130 (ANTXR2) 
Rs378108 (21q22) 
Rs2297909 (KIF2B) 
Rs10045431 (IL12B) 
Rs10903118 (RUNX3) 
Rs7720838 (PTGER4) 
Rs2058276 (Y-marker) 



Rs10484554 (HLA-C) 
Rs20541 (IL13) 
Rs13017599 (RED 
Rs2066808 (IL23A) 
Rs12924903 (RUNX3) 
Rs4795067 (N0S2) 
Rs4379175 (IL12B) 
Rs4982254 (PSMA6) 
Rs13151961 (IL2/21) 
Rs11209026 (IL23R) 
Rs7720838 (PTGER4) 



Rs660895 (HLA-DRB1) 
Rs2476601 (PTPN22) 
Rs3761847 (TRAF1/C5) 
Rs3890745 (MM ELI) 
Rs13031237 (RED 
Rs7574865 (STAT4) 
Rs548234 (PRDM1) 
Rs2327832 (TNFAIP3) 
Rsl 569723 (CD40) 
Rs11574914 (CCL21) 
Rs11172254(KIF5A) 
Rs231804 (CTLA4) 
Rsl 160542 (AFF3) 
Rs13151961 (IL2/21) 



SNPs were selected for the Naive Bayes Algorithm to determine inflammatory 
arthritis categories. Feature selection for the SNPs was performed through a lit- 
erature search on the three diseases of interest from existing GWAS studies: 
Ankylosing spondylitis (AS), Psoriatic arthritis (PsA), and Rheumatoid arthritis 
(RA). The most significant and replicable SNPs were used to construct the 
entire set. Subsequent evaluation of features used Wrapper Subset Evaluation, 
ChlSq Attribute Evaluation, Classifier Subset Evaluation, and Information Gain 
from Attributes methods wflthin the program WEKA. These methods were cou- 
pled with a variety of search methods including genetic algorithm-based. Greedy 
Stepwise Selection, and Linear Forward Search methods within WEKA. 



Table 2 | Relative performance across machine learning methods. 



Algorithm 


Number 


Accuracy (%) 


10-fold CV 




of SNPs 




accuracy (%) 


Decision tree 


33 


87 


74.1 


Neural network 


5 


76 


771 


Logistic regression 


24 


78 


777 


Support vector machine 


1 


73 


72.6 


K-nearest neiglibor 


7 


77 


772 


Naive bayes 


29 


78 


779 



Six Machine Learning algorithms using feature selection and classification were 
evaluated for accuracy using complied data from the literature in a synthetic 
training data set. Effect sizes and genotype frequencies were estimated from 
the literature and incorporated into the synthetic data set to train the algorithms. 
Accuracy was defined as the proportion of individuals that were correctly pre- 
dicted (either true positives or true negatives). 10-fold Cross Validation (CV) was 
performed within WEKA and used as the criterion for selection of an algorithm 
to use In the test set from Marshfield. Naive Bayes exhibited slightly higher CV 
accuracy when compared to other algorithms, with a low amount of overfittlng. 

the performance with an average area under the ROC curve of 
0.635 (Figures). Although the difference between this observed 
AUC of 0.635 and that expected under the null (AUC = 0.500) 
is statistically significant, we conclude that additional, orthog- 
onal predictive variables, such as clinical features, circulating 
cytokine profiles or additional genetic variants, are necessary 
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Average Model Performance 




0.4 0.6 
False Positive Rate 



FIGURE 8 I AUG for inflammatory arthritis prediction study for the 
Marshfield population. Tlie Naive Bayes classifier developed using data 
from the literature was applied to the Marshfield population of 
inflammatory arthritis individuals: Rheumatoid arthritis (RA), Psoriatic 
arthritis (PsA), and Ankylosing spondylitis (AS). The model generated an 
AUG of 0.635, which was statistically significant via permutation. In 
addition, performance on randomized sample sets is shown in red, showing 
an expected null performance. 



to build a clinically useful prognostic test for classifying these 
diseases. 

SUMMARY 

We have summarized some of the seminal issues in utiliz- 
ing genetic information in predictive models for disease traits. 
Currently, the application of genetic-based predictive models to 
common diseases is, generally speaking, disappointing from both 
theoretical and empirical lines of evidence. There are some bright 
spots, including AMD, Crohn's disease, and special applications 
to selected populations with increased posterior probabilities 
due to non-genetic factors. Additionally, if the current wave of 
sequence-based disease gene mapping uncovers sufficient num- 
bers of highly penetrant alleles, then these may provide clinically 
relevant prognostic utility. Outside of common disease prognos- 
tics, tumor genetics, screening for inherited Mendelian disorders, 
and some pharmacogenetic applications have exhibited the most 
progress over the past five years. The reasons for this stem from 
the reduced complexity of the genetic architecture of these traits, 
yielding extremely high or extremely low posterior probabilities. 
Certainly, many questions in the field remain. As our understand- 
ing of the nature of elements that resolve the missing heritability 
problem matures, the path to applying predictive modeling meth- 
ods will become clearer. What needs to fall in place for clinically 
useful prediction of complex diseases? We speculate that six 
critical steps will aid this process: 

(1) Through next-generation sequencing platforms applied to 
both linkage and association designs, identification of 



additional susceptibility variants will fully cover the allele 
frequency spectrum and capture disease-predictive alleles. 
However, the discovery of rare, highly penetrant risk alle- 
les will be most useful as clinical sequencing becomes 
widespread and applied earlier in life. 

(2) As other elements besides DNA sequences are inherited and 
contribute to phenotypic variance, the interrogation of addi- 
tional possible contributors to heritability, including DNA 
methylation patterns, histone modifications, transgenera- 
tional effects, and other factors correlated with disease traits, 
will capture more of the molecularly-defined heritability. 

(3) Redefining disease phenotypes to more accurately mirror 
the underlying molecular pathophysiology will be critical 
in reducing disease complexity and better enable genetic 
susceptibility mapping. For example, partitioning diseases 
by molecular subtypes will identify physiological subgroups 
with clearer correspondence with the underlying genetics. 
Within the context of research using biobanks Hnked to med- 
ical records, relevant laboratory tests or imaging information, 
or both, would also be valuable. 

(4) Considerable progress has been made in the field of machine 
learning, where robust methods have been developed to select 
features and use them in predictive models. Applying these 
approaches to genetic data in combination with existing labo- 
ratory tests, imaging data, and other established medical tests 
will offer the best chance of creating viable prognostics. 

(5) Metrics that capture prognostic utility in a way that accu- 
rately reflects what a clinician requires to inform medical 
decisions will be developed. 

(6) The application of disease predictive models to diverse clin- 
ical populations will clarify the performance and limitations 
of proposed predictive models and improve medical practice. 

In summary, while prediction will continue to be challenging, 
future investigations promise to provide a wealth of informa- 
tion, some of which will be clinically useful if considered in the 
appropriate context. 
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